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Abstract 

We introduce Selective Greedy Equivalence 
Search (SGES), a restricted version of Greedy 
Equivalence Search (GES). SGES retains the 
asymptotic correctness of GES but, unlike GES, 
has polynomial performance guarantees. In par¬ 
ticular, we show that when data are sampled in¬ 
dependently from a distribution that is perfect 
with respect to a DAG Q defined over the ob¬ 
servable variables then, in the limit of large data, 
SGES will identify Q’s equivalence class after 
a number of score evaluations that is (1) poly¬ 
nomial in the number of nodes and (2) expo¬ 
nential in various complexity measures including 
maximum-number-of-parents, maximum-clique- 
size, and a new measure called v-width that 
is at least as small as—and potentially much 
smaller than—the other two. More generally, 
we show that for any hereditary and equivalence- 
invariant property If known to hold in Q, we 
retain the large-sample optimality guarantees of 
GES even if we ignore any GES deletion oper¬ 
ator during the backward phase that results in a 
state for which If does not hold in the common- 
descendants subgraph. 

1 INTRODUCTION 

Greedy Equivalence Search (GES) is a score-based 
search algorithm that searches over equivalence classes 
of Bayesian-network structures. The algorithm is appeal¬ 
ing because (1) for finite data, it explicitly (and greedily) 
tries to maximize the score of interest, and (2) as the data 
grows large, it is guaranteed—under suitable distributional 
assumptions—to return the generative structure. Although 
empirical results show that the algorithm is efficient in real- 
world domains, the number of search states that GES needs 
to evaluate in the worst case can be exponential in the num¬ 
ber of domain variables. 


In this paper, we show that if we assume the generative dis¬ 
tribution is perfect with respect to some DAG Q defined 
over the observable variables, and if Q is known to be con¬ 
strained by various graph-theoretic measures of complex¬ 
ity, then we can disregard all but a polynomial number of 
the backward search operators considered by GES while 
retaining the large-sample guarantees of the algorithm; we 
call this new variant of GES selective greedy equivalence 
search or SGES. Our complexity results are a consequence 
of a new understanding of the backward phase of GES, 
in which edges (either directed or undirected) are greed¬ 
ily deleted from the current state until a local minimum is 
reached. We show that for any hereditary and equivalence- 
invariant property known to hold in generative model Q, 
we can remove from consideration any edge-deletion op¬ 
erator between X and Y for which the property does not 
hold in the resulting induced subgraph over X, Y, and their 
common descendants. As an example, if we know that each 
node has at most k parents, we can remove from consider¬ 
ation any deletion operator that results in a common child 
with more than k parents. 

We define a new notion of complexity that we call v-width. 
Eor a given generative structure Q, v-width is necessarily 
smaller than the maximum clique size, which is necessar¬ 
ily smaller than or equal to the maximum number of parents 
per node. By casting limited v-width and other complexity 
constraints as graph properties, we show how to enumerate 
directly over a polynomial number of edge-deletion oper¬ 
ators at each step, and we show that we need only a poly¬ 
nomial number of calls to the scoring function to complete 
the algorithm. 

The main contributions of this paper are theoretical. Our 
definition of the new SGES algorithm deliberately leaves 
unspecified the details of how to implement its forward 
phase; we prove our results for SGES given any implemen¬ 
tation of this phase that completes with a polynomial num¬ 
ber of calls to the scoring function. A naive implementation 
is to immediately return a complete (i.e., no independence) 
graph using no calls to the scoring function, but this choice 
is unlikely to be reasonable in practice, particularly in dis- 


Crete domains where the sample complexity of this initial 
model will likely be a problem. Whereas we believe it an 
important direction, our paper does not explore practical al¬ 
ternatives for the forward phase that have polynomial-time 
guarantees. 

This paper, which is an expanded version of Chickering and 
Meek (2015) and includes all proofs, is organized as fol¬ 
lows. In Section]^ we describe related work. In Section]^ 
we provide notation and background material. In Section]^ 
we present our new SGES algorithm, we show that it is op¬ 
timal in the large-sample limit, and we provide complexity 
bounds when given an equivalence-invariant and hereditary 
property that holds on the generative structure. In Section 
we present a simple synthetic experiment that demon¬ 
strates the value of restricting the backward operators in 
SGES. We conclude with a discussion of our results in Sec¬ 
tion |6] 

2 RELATED WORK 

It is useful to distinguish between approaches to learn¬ 
ing the structure of graphical models as constraint based, 
score based or hybrid. Constraint-based approaches typi¬ 
cally use (conditional) independence tests to eliminate po¬ 
tential models, whereas score-based approaches typically 
use a penalized likelihood or a marginal likelihood to eval¬ 
uate alternative model structures; hybrid methods combine 
these two approaches. Because score-based approaches are 
driven by a global likelihood, they are less susceptible than 
constraint-based approaches to incorrect categorical deci¬ 
sions about independences. 

There are polynomial-time algorithms for learning the best 
model in which each node has at most one parent. In 
particular, the Chow-Liu algorithm (Chow and Liu, 1968) 
used with any equivalence-invariant score will identify 
the highest-scoring tree-like model in polynomial time; 
for scores that are not equivalence invariant, we can use 
the polynomial-time maximum-branching algorithm of Ed¬ 
monds (1967) instead. Gaspers et al. (2012) show how to 
learn k-branchings in polynomial time; these models are 
polytrees that differ from a branching by a constant k num¬ 
ber of edge deletions. 

Without additional assumptions, most results for learning 
non-tree-like models are negative. Meek (2001) shows that 
finding the maximum-likelihood path is NP-hard, despite 
this being a special case of a tree-like model. Dasgupta 
(1999) shows that finding the maximum-likelihood poly¬ 
tree (a graph in which each pair of nodes is connected 
by at most one path) is NP-hard, even with bounded in¬ 
degree for every node. Eor general directed acyclic graphs, 
Chickering (1996) shows that finding the highest marginal- 
likelihood structure under a particular prior is NP-hard, 
even when each node has at most two parents. Chicker¬ 
ing at al. (2004) extend this same result to the large-sample 


case. 

Researchers often assume that the training-data “genera¬ 
tive” distribution is perfect with respect to some model 
class in order to reduce the complexity of learning algo¬ 
rithms. Geiger et al. (1990) provide a polynomial-time 
constraint-based algorithm for recovering a polytree un¬ 
der the assumption that the generative distribution is per¬ 
fect with respect to a polytree; an analogous score-based 
result follows from this paper. The constraint-based PC 
algorithm of Sprites et al. (1993) can identify the equiva¬ 
lence class of Bayesian networks in polynomial time if the 
generative structure is a DAG model over the observable 
variables in which each node has a bounded degree; this 
paper provides a similar result for a score-based algorithm. 
Kalish and Buhlmann (2007) show that for Gaussian dis¬ 
tributions, the PC algorithm can identify the right structure 
even when the number of nodes in the domain is larger than 
the sample size. Chickering (2002) uses the same DAG- 
perfectness-over-observables assumption to show that the 
greedy GES algorithm is optimal in the large-sample limit, 
although the branching factor of GES is worst-case expo¬ 
nential; the main result of this paper shows how to limit 
this branching factor without losing the large-sample guar¬ 
antee. Chickering and Meek (2002) show that GES iden¬ 
tifies a “minimal” model in the large-sample limit under a 
less restrictive set of assumptions. 

Hybrid methods for learning DAG models use a constraint- 
based algorithm to prune out a large portion of the search 
space, and then use a score-based algorithm to select 
among the remaining (Eriedman et al., 1999; Tsamardinos 
et al., 2006). Ordyniak and Szeider (2013) give positive 
complexity results for the case when the remaining DAGs 
are characterized by a structure with constant treewidth. 

Many researchers have turned to exhaustive enumeration 
to identify the highest-scoring model (Gillispie and Perl¬ 
man, 2001; Koivisto and Sood 2004; Silander and Myl- 
lymaki, 2006; Kojima et al, 2010). There are many com¬ 
plexity results for other model classes. Karger and Sre- 
bro (2001) show that finding the optimal Markov net¬ 
work is NP-complete for treewidth > 1. Narasimhan and 
Bilmes (2004) and Shahaf, Chechetka and Guestrin (2009) 
show how to learn approximate limited-treewidth models in 
polynomial time. Abeel, Roller and Ng (2005) show how 
to learn factor graphs in polynomial time. 

3 NOTATION AND BACKGROUND 

We use the following syntactical conventions in this paper. 
We denote a variable by an upper case letter (e.g.. A) and 
a state or value of that variable by the same letter in lower 
case (e.g., a). We denote a set of variables by a bold-face 
capitalized letter or letters (e.g., X). We use a correspond¬ 
ing bold-face lower-case letter or letters (e.g., x) to denote 
an assignment of state or value to each variable in a given 


set. We use calligraphic letters (e.g., Q,E) to denote statis¬ 
tical models and graphs. 

A Bayesian-network model for a set of variables U is a pair 
{G,d)- Q — (V,E) is a directed acyclic graph—or DAG 
for short—consisting of nodes in one-to-one correspon¬ 
dence with the variables and directed edges that connect 
those nodes. 0 is a set of parameter values that specify all 
of the conditional probability distributions. The Bayesian 
network represents a joint distribution over U that factors 
according to the structure Q. 

The structure ^ of a Bayesian network represents the inde¬ 
pendence constraints that must hold in the distribution. The 
set of all independence constraints implied by the structure 
Q can be characterized by the Markov conditions, which 
are the constraints that each variable is independent of its 
non-descendants given its parents. All other independence 
constraints follow from properties of independence. A dis¬ 
tribution dehned over the variables from Q is perfect with 
respect to Q if the set of independences in the distribution is 
equal to the set of independences implied by the structure 

Q. 

Two DAGs Q and Q' are equivalent —denoted G ~ G '—if 
the independence constraints in the two DAGs are identi¬ 
cal. Because equivalence is reflexive, symmetric, and tran¬ 
sitive, the relation dehnes a set of equivalence classes over 
network structures. We will use [C/]~ to denote the equiva¬ 
lence class of DAGs to which G belongs. 

An equivalence class of DAGs F is an independence map 
(IMAP) of another equivalence class of DAGs S if all in¬ 
dependence constraints implied by F are also implied by 
£. For two DAGs G and %, we use G < PL to denote that 
\H\Ki is an IMAP of we use G < PL when Q <% and 

[nu ^ [e]- 

As shown by Verma and Pearl (1991), two DAGs are 
equivalent if and only if they have the same skeleton (i.e., 
the graph resulting from ignoring the directionality of the 
edges) and the same v-structures (i.e., pairs of edges X —?> 
Y and Y ^ Z where X and Z are not adjacent). As 
a result, we can use a partially directed acyclic graph — 
or PDAG for short—to represent an equivalence class of 
DAGs: for a PDAG V, the equivalence class of DAGs is 
the set that share the skeleton and v-structures with 

We extend our notation for DAG equivalence and the DAG 
IMAP relation to include the more general PDAG structure. 
In particular, for a PDAG V, we use [7^]~ to denote the 
corresponding equivalence class of DAGs. For any pair of 
PDAGs V and Q —where one or both may be a DAG—we 

*We make the standard conditional-distribution assumptions 
of multinomials for discrete variables and Gaussians for contin¬ 
uous variables so that if two DAGs have the same independence 
constraints, then they can also model the same set of distributions. 

^The definitions for the skeleton and set of v-structures for a 
PDAG are the obvious extensions to these definitions for DAGs. 


use V Qto denote [Q]r; = [P]ki and we use V < Qto 
denote [QJw is an IMAP of \P]a- To avoid confusion, for 
the remainder of the paper we will reserve the symbols G 
and PL for DAGs. 

For any PDAG V and subset of nodes V, we use V\V] to 
denote the subgraph of V induced by V; that is, V[V] has 
as nodes the set V and has as edges all those from V that 
connect nodes in V. We use NAx,y to denote, within 
a PDAG, the set of nodes that are neighbors of X (i.e., 
connected with an undirected edge) and also adjacent to 
Y (i.e., without regard to whether the connecting edge is 
directed or undirected). 

An edge in G is compelled if it exists in every DAG that 
is equivalent to (/. If an edge in G is not compelled, we 
say that it is reversible. A completed PDAG (CPDAG) C 
is a PDAG with two additional properties: (1) for every di¬ 
rected edge in C, the corresponding edge in G is compelled 
and (2) for every undirected edge in C the corresponding 
edge in G is reversible. Unlike non-completed PDAGs, the 
CPDAG representation of an equivalence class is unique. 

T> 

We use Pay to denote the parents of node Y in V. An 
edge X ^ Y is covered in a DAG if X and Y have the 
same parents, with the exception that X is not a parent of 
itself. 

3.1 Greedy Equivalence Search 


Algorithm GESiTi) 

Input : Data D 
Output: CPDAG C 

C i — FES(D) 

C i — BES(D, C) 

return C 


Eigure 1: Pseudo-code for the GES algorithm. 

The GES algorithm, shown in Eigure performs a two- 
phase greedy search through the space of DAG equivalence 
classes. GES represents each search state with a CPDAG, 
and performs transformation operators to this representa¬ 
tion to traverse between states. Each operator corresponds 
to a DAG edge modification, and is scored using a DAG 
scoring function that we assume has three properties. Eirst, 
we assume the scoring function is score equivalent, which 
means that it assigns the same score to equivalent DAGs. 
Second, we assume the scoring function is locally consis¬ 
tent, which means that, given enough data, (1) if the cur¬ 
rent state is not an IMAP of G, the score prefers edge ad¬ 
ditions that remove incorrect independences, and (2) if the 
current state is an IMAP of G, the score prefers edge dele¬ 
tions that remove incorrect dependences. Einally, we as¬ 
sume the scoring function is decomposable, which means 








we can express it as: 


n 

Score{g,J:)) = ^5'core(X„Paf) (1) 

Note that the data D is implicit in the right-hand side Equa¬ 
tion [2 Most commonly used scores in the literature have 
these properties. For the remainder of this paper, we as¬ 
sume they hold for the scoring function we use. 

All of the CPDAG operators from GES are scored using 
differences in the DAG scoring function, and in the limit of 
large data, these scores are positive precisely for those op¬ 
erators that remove incorrect independences and incorrect 
dependences. 

The first phase of the GES—called forward equivalence 
search or FES —starts with an empty (i.e., no-edge) 
CPDAG and greedily applies GES insert operators until no 
operator has a positive score; these operators correspond 
precisely to the union of all single-edge additions to all 
DAG members of the current (equivalence-class) state. Af¬ 
ter FES reaches a local maximum, GES switches to the sec¬ 
ond phase—called backward equivalence search or BES — 
and greedily applies GES delete operators until no operator 
has a positive score; these operators correspond precisely to 
the union of all single-edge deletions from all DAG mem¬ 
bers of the current state. 

Theorem 1. (Chickering, 2002) Let C be the CPDAG that 
results from applying the GES algorithm to m records sam¬ 
pled from a distribution that is perfect with respect to DAG 
Q. Then in the limit of large m, C ~ Q. 

The role of FES in the large-sample limit is only to identify 
a state C for which Q <C\ Theorem[T]holds for GES under 
any implementation of FES that results in an IMAP of g. 
The implementation details can be important in practice be¬ 
cause what constitutes a “large” amount of data depends on 
the number of parameters in the model. In theory, however, 
we could simply replace FES with a (constant-time) algo¬ 
rithm that sets C to be the no-independence equivalence 
class. 

The focus of our analysis in the next section is on a mod¬ 
ified version of BES, and the details of the delete operator 
used in this phase are important. We detail the precondi¬ 
tions, scoring function, and transformation algorithm for a 
delete operator in Figure]^ We note that we do not need to 
make any CPDAG transformations when scoring the oper¬ 
ators; it is only once we have identified the highest-scoring 
(non-negative) delete that we need to make the transforma¬ 
tion shown in the hgure. After applying the edge modifi¬ 
cations described in the foreach loop, the resulting PDAG 
V is not necessarily completed and hence we may have to 
convert V into the corresponding CPDAG representation. 
As shown by Chickering (2002), this conversion can be ac¬ 
complished easily by using the structure of V to extract a 


Operator: Delete{X, Y, H) applied to C 

• Preconditions 

X and Y are adjacent 
H C NAy,x 

H = NAy X \ H is a clique 

• Scoring 

Score{Y, {Pa^. U H} \ A) - Score{Y, X U Pa^ U H) 

• Transformation 

Remove edge between X and Y 
foreach Tf e H do 

Replace Y — H with Y ^ H 

it X — H then Replace with A —> iJ 

end 

Convert to CPDAG 


Figure 2: Preconditions, scoring, and transformation algo¬ 
rithm for a delete operator applied to a CPDAG. 


DAG that we then convert into a CPDAG by undirecting all 
reversible edges. The complexity of this procedure for a V 
with n nodes and e edges is 0{n ■ e), and requires no calls 
to the scoring function. 

4 SELECTIVE GREEDY EQUIVALENCE 
SEARCH 

In this section, we dehne a variant of the GES algorithm 
called selective GES —or SGES for short—that uses a sub¬ 
set of the GES operators. The subset is chosen based on a 
given property 11 that is known to hold for the generative 
structure g. Just like GES, SGES—shown inFigure[3]— ^has 
a forward phase and a backward phase. 

For the forward phase of SGES, it suffices for our theoret¬ 
ical analysis that we use a method that returns an IMAP of 
g (in the large-sample limit) using only a polynomial num¬ 
ber of insert-operator score calls. For this reason, we call 
this phase poly-FES. A simple implementation of poly-FES 
is to return the no-independence CPDAG (with no score 
calls), but other implementations are likely more useful in 
practice. 

The backward phase of SGES—which we call selective 
backward equivalence search (SEES )—uses only a subset 
of the BES delete operators. This subset must necessarily 
include all H-consistent delete operators—defined below— 
in order to maintain the large-sample consistency of GES, 
but the subset can (and will) include additional operators 
for the sake of efficient enumeration. 

The DAG properties used by SGES must be equivalence 
invariant, meaning that for any pair of equivalent DAGs, 





either the property holds for both of them or it holds for 
neither of them. Thus, for any equivalence-invariant DAG 
property If, it makes sense to say that If either holds or 
does not hold for a PD AG. As shown by Chickering (1995), 
a DAG property is equivalence invariant if and only if it is 
invariant to covered-edge reversals; it follows that the prop¬ 
erty that each node has at most k parents is equivalence in¬ 
variant, whereas the property that the length of the longest 
directed path is at least k is not. Furthermore, the proper¬ 
ties for SGES must also be hereditary, which means that 
if n holds for a PDAG V it must also hold for all induced 
subgraphs of V. For example, the max-parent property is 
hereditary, whereas the property that each node has at least 
k parents is not. We use EIH property to refer to a property 
that is equivalence invariant and hereditary. 

Definition 1. Il-Consistent GES Delete 

A GES delete operator Delete{X, Y, H) is If consistent/or 
CPDAG C if, for the set of common descendants W of X 
and Y in the resulting CPDAG C', the property holds for 
the induced subgraph C'[X GY U W]. 

In other words, after the delete, the property holds for the 
subgraph dehned by X, Y, and their common descendants. 


Algorithm SGESiJi, If) 

Input : Data D, Property If 
Output: CPDAG C 

C i — poly-FES 
C i — SBES(D, C, n) 

return C 


Eigure 3: Pseudo-code for the SGES algorithm. 


Algorithm SBES[D,C, H) 

Input : Data D, CPDAG C, Property If 
Output: CPDAG 

Repeat 

Ops ^— Generate Il-consistent delete operators for C 
Op i — highest-scoring operator in Ops 
if score of Op is negative then return C C <— Apply 
Op to C 


Eigure 4: Pseudo-code for the SEES algorithm. 

4.1 LARGE-SAMPLE CORRECTNESS 

The following theorem establishes a graph-theoretic justi- 
hcation for considering only the Il-consistent deletions at 
each step of SEES. 

Theorem 2. If Q < C for CPDAG C and DAG Q, then 
for any EIH property 11 that holds on Q, there exists a 11- 


consistent Delete{X, Y, H) that when applied to C results 
in the CPDAG C' for which Q < C'. 

We postpone the proof of Theoremj^to the appendix. The 
result is a consequence of an explicit characterization of, 
for a given pair of DAGs Q and H. such that Q < "H, an 
edge in H that we can either reverse or delete in H such 
that for the resulting DAG , we have Q < Ti.^ 

Theorem 3. Let C be the CPDAG that results from apply¬ 
ing the SGES algorithm to (1) m records sampled from a 
distribution that is perfect with respect to DAG Q and (2) 
EIH property 11 that holds on Q. Then in the limit of large 
m, C ~ Q. 

Proof: Because the scoring function is locally consistent, 
we know poly-EES must return an IMAP of Q. Because 
SEES includes all the Il-consistent delete operators. The¬ 
orem 1^ guarantees that, unless C ^ Q, there will be a 
positive-scoring operator. □ 

4.2 COMPLEXITY MEASURES 

In this section, we discuss a number of distributional as¬ 
sumptions that we can use with Theoremj^to limit the num¬ 
ber of operators that SGES needs to score. As discussed in 
Section when we assume the generative distribution is 
perfect with respect to a DAG Q, then graph-theoretic as¬ 
sumptions about Q can lead to more efficient training algo¬ 
rithms. Common assumptions used include (1) a maximum 
parent-set size for any node, (2) a maximum-cliqu^ size 
among any nodes and (3) a maximum treewidth. Treewidth 
is important because the complexity of exact inference is 
exponential in this measure. 

We can associate a property with each of these assumptions 
that holds precisely when the DAG Q satisfies that assump¬ 
tion. Consider the constraint that the maximum number of 
parents for any node in Q is some constant k. Then, us¬ 
ing “PS” to denote parent size, we can define the property 
Ilpg to be true precisely for those DAGs in which each 
node has at most k parents. Similarly we can dehne II^p 
and to correspond to maximum-clique size and max¬ 
imum treewidth, respectively. 

Eor two properties 11 and 11', we write 11 C 11' if for every 
DAG Q for which 11 holds, 11' also holds. In other words, 
n is a more constraining property than is 11'. Because the 
lowest node in any clique has all other nodes in the clique 
as parents, it is easy to see that Ilpg C Because the 

treewidth for DAG Q is dehned to be the size of the largest 
clique minus one in a graph whose cliques are at least as 
large as those in Q, we also have Ilp^y C IIp/^. Which 

^Chickering (2002) characterizes the reverse transformation of 
reversals/additions in Q, which provides an implicit characteriza¬ 
tion of reversals/deletions in H. 

"'We use clique in a DAG to mean a set of nodes in which all 
pairs are adjacent. 









property to use will typically be a trade-off between how 
reasonable the assumption is (i.e, less constraining proper¬ 
ties are more reasonable) and the efficiency of the resulting 
algorithm (i.e., more constraining properties lead to faster 
algorithms). 

We now consider a new complexity measure called v-width, 
whose corresponding property is less constraining than the 
previous three, and somewhat remarkably leads to an effi¬ 
cient implementation in SGES. For a DAG Q, the v-width 
is defined to be the maximum of, over all pairs of non- 
adjacent nodes X and Y, the size of the largest clique 
among common children of X and Y. In other words, 
v-width is similar to the maximum-clique-size bound, ex¬ 
cept that the bound only applies to cliques of nodes that are 
shared children of some pair of non-adjacent nodes. With 
this understanding it is easy to see that, for the property 
corresponding to a bound on the v-width, we have 
TT^ C TT* 

To illustrate the difference between v-width and the other 
complexity measures, consider the two DAGs in Figure 
The DAG in Figure |^a) has a clique of size K, and con¬ 
sequently a maximum-clique size of K and a maximum 
parent-set size of K — 1. Thus, if K is 0{n) for a large 
graph of n nodes, any algorithm that is exponential in these 
measures will not be efficient. The v-width, however, is 
zero for this DAG. The DAG in Figure |^b), on the other 
hand, has a v-width of K. 



Figure 5; Two DAGs (a) and (b) having identical maximum 
clique sizes, similar maximum number of parents, and di¬ 
vergent v-widths. 


In order to use a property with SGFS, we need to estab¬ 
lish that it is FIH. For IIp^, and HyiY, equivalence- 
invariance follows from the fact that all three properties are 
covered-edge invariant, and hereditary follows because the 
corresponding measures cannot increase when we remove 
nodes and edges from a DAG. Although we can estab¬ 
lish FIH for the treewidth property Tl^w more work, 
we omit further consideration of treewidth for the sake of 
space. 


4.3 GENERATING DELETIONS 

In this section, we show how to generate a set of dele¬ 
tion operators for SEES such that all H-consistent deletion 
operators are included, for any H G {Hpg, H^p, 11^^^}. 
Furthermore, the total number of deletion operators we 
generate is polynomial in the number of nodes in the do¬ 
main and exponential in k. 

Our approach is to restrict the Delete{X, Y, H) operators 
based on the H sets and the resulting CPDAG C'. In par¬ 
ticular, we rule out candidate H sets for which H does 
not hold on the induced subgraph C^[H U X U Y]-, because 
all nodes in H will be common children of X and Y in 
C' —and thus a subset of the common descendants of X 
and Y —we know from Definition [T] (and the fact that H is 
hereditary) that none of the dropped operators can be H- 
consistent. 

Before presenting our restricted-enumeration algorithm, 
we now discuss how to enumerate delete operators with¬ 
out restrictions. As shown by Andersson et al. (1997), a 
CPDAG is a chain graph whose undirected components are 
chordal. This means that the induced sub-graph defined 
over NAy X—which is a subset of the neighbors of F—is 
an undirected chordal graph. A useful property of chordal 
graphs is that we can identify, in polynomial time, a set of 
maximal cliques over these nodef^ let Ci,..., Cm denote 
the nodes contained within these m maximal cliques, and 
let H = NAyx \ H be the complement of the shared 
neighbors with respect to the candidate H. Recall from 
Figure that the preconditions for any Delete{X, Y, H) 
include the requirement that H is a clique. This means that 
for any valid H, there must be some maximal clique Ci that 
contains the entirety of H; thus, we can generate all oper¬ 
ators (without regard to any property) by stepping through 
each maximal clique in turn, initializing H to be all 
nodes not in C^, and then generating a new operator cor¬ 
responding to expanding H by all subsets of nodes in C^. 
Note that if NAyx is itself a clique, we are enumerating 
over all operators. 

As we show below, all three of the properties of interest 
impose a bound on the maximum clique size among nodes 
in H. If we are given such a bound s, we know that any 
“expansion” subset for a clique that has size greater than 
s will result in an operator that is not valid. Thus, we can 
implement the above operator-enumeration approach more 
efficiently by only generating subsets within each clique 
that have size at most s. This allows us to process each 
clique Ci with only 0(|Ci -f 1|'*) calls to the scoring func¬ 
tion. In addition, we need not enumerate over any of the 
subsets of Ci if, after removing this clique from the graph, 
there remains a clique of size greater than s; we define the 

^Blair and Peyton (1993) provide a good survey on chordal 
graphs and detail how to identify the maximal cliques while run¬ 
ning maximum-cardinality search. 







Algorithm Selective-Generate-Ops(C, X, Y, s) 

Input : CPDAG C with adjacent X,Y and limit s 

Output: Ops = {Hi,. .., H^} 

Ops <— 0 

Generate maximal cliques Ci, Cm from NAy x 
S < — FiltertCliques {{Cl, ..., Cm}, s) 

foreach e S do 

Ho ^ NAy,x \ C, 

foreach C C with |C| < s do 
Add Ho U C to Ops 

end 

end 

return Ops 


Figure 6: Algorithm to generate clique-size limited delete 
operators. 


function FilterCliques{{Ci,..., Cm}, s) to be the sub¬ 
set of cliques that remain after imposing this constraint. 
With this function, we can define Selective-Generate- 
Ops as shown in Figure to leverage the max-clique-size 
constraint when generating operators; this algorithm will in 
turn be used to generate all of the CPDAG operators during 
SEES. 

Example: In Figure |7] we show an example CPDAG for 
which to run Selective-Generate-Ops(C, X, Y, s) for 
various values of s. In the example, there is a single clique 
C = [A, B} in the set NAy jf, and thus at the top of the 
outer foreach loop, the set Hq is initialized to the empty 
set. If s = 0, the only subset of C with size zero is the 
empty set, and so that is added to Ops and the algorithm 
returns. If s = 1 we add, in addition to the empty set, all 
singleton subsets of C. For s > 2, we add all subsets of 
C. □ 

Now we discuss how each of the three properties impose a 
constraint s on the maximum clique among nodes in H, and 
consequently the selective-generation algorithm in Figure]^ 
can be used with each one, given an appropriate bound s. 
For both and the k given imposes an explicit 

bound on s (i.e., s = k for both). Because any clique in 
H of size r will result in a DAG member of the resulting 
equivalence class having a node in that clique with at least 
r -I- 1 parents (i.e., r — 1 from the other nodes in the clique, 
plus both X and Y), we have for IIp^, s = k — 1. 

We summarize the discussion above in the following 
proposition. 

Proposition 1. Algorithm Selective-Generate-Ops 
applied to all edges using clique-size bound s gen¬ 
erates all li-consistent delete operators for If C 
{IIps , IIcL’ ^vw}- 


We now argue that running SEES on a domain of n vari¬ 
ables when using Algorithm Selective-Generate-Ops 
with a bound s requires only a polynomial number in n of 
calls to the scoring function. Each clique in the inner loop 
of the algorithm can contain at most n nodes, and therefore 
we generate and score at most (n-|-1)® operators, requiring 
at most 2{n -\- 1)® calls to the scoring function. Because 
the cliques are maximal, there can be at most n of them 
considered in the outer loop. Because there are never more 
than nf edges in a CPDAG, and we will delete at most all 
of them, we conclude that even if we decided to rescore ev¬ 
ery operator after every edge deletion, we will only make a 
polynomial number of calls to the scoring function. 

Erom the above discussion and the fact that SEES com¬ 
pletes using at most a polynomial number of calls to the 
scoring function, we get the following result for the full 
SGES algorithm. 

Proposition 2. The SGES algorithm, when run over a do¬ 
main of n variables and given If £ If^p, 11^^}, 

runs to completion using a number of calls to the DAG 
scoring function that is polynomial in n and exponential 
in s. 



Ops={{}} 

Ops={{},{A},{S}} 

Ops={{},{A},{S},{A6}} 


Eigure 7: An example CPDAG C and the resulting opera¬ 
tors generated by Selective-Generate-Ops(C,A,F,s) 
for various values of s. 


5 EXPERIMENTS 

In this section, we present a simple synthetic experiment 
comparing SEES and EES that demonstrates the value of 
pruning operators. In our experiment we used an oracle 
scoring function. In particular, given a generative model Q, 
our scoring function computes the minimum-description- 
length score assuming a data size of five billion records, 
but without actually sampling any data: instead, we use 
exact inference in Q (i.e., instead of counting from data) 
to compute the conditional probabilities needed to compute 
the expected log loss. This allows us to get near-asymptotic 
behavior without the need to sample data. To evaluate the 
cost of running each algorithm, we counted the number of 
times the scoring function was called on a unique node and 
parent-set combination; we cached these scores away so 
that if they were needed multiple times during a run of the 
algorithm, they were only computed (and counted) once. 









In Figure we show the average number of scoring- 
function calls required to complete BBS and SEES when 
starting from a complete graph over a domain of n bi¬ 
nary variables, for varying values of n. Each average is 
taken over ten trials, corresponding to ten random genera¬ 
tive models. All variables in the domain were binary. We 
generated the structure of each generative model as follows. 
Eirst, we enumerated all node pairs by randomly permuting 
the nodes and taking each node in turn with all of its pre¬ 
decessors in turn. Eor each node pair in turn, we chose to 
attempt an edge insertion with probability one half. Eor 
each attempt, we added an edge if doing so (1) did not cre¬ 
ate a cycle and (2) did not result in a node having more 
than two parents; if an edge could be added in either di¬ 
rection, we chose the direction at random. We sampled 
the conditional distributions for each node and each par¬ 
ent configuration from a uniform Dirichlet distribution with 
equivalent-sample size of one. We ran SEES with Ilpg. 


25000 



Eigure 8: Number of score evaluations needed to run EES 
and SEES, starting from the complete graph, for a range of 
domain sizes. 

Our results show clearly the exponential dependence of 
EES on the number of nodes in the clique, and the increas¬ 
ing savings we get with SEES, leveraging the fact that lip g 
holds in the generative structure. 

Note that to realize large savings in practice, when GES 
runs EES instead of starting from a dense graph, a (rel¬ 
atively sparse) generative distribution must lead EES to an 
equivalence class containing a (relatively dense) undirected 
clique that is subsequently “thinned” during EES. We can 
synthesize challenging grid distributions to force EES into 
such states, but it is not clear how realistic such distribu¬ 
tions are in practice. When we re-run the clique experi¬ 
ment above, but where we instead start both EES and SEES 
from the model that results from running EES (i.e., with 
no polynomial-time guarantee), the savings from SEES are 


small due to the fact that the subsequent equivalence classes 
do not contain large cliques. 

6 CONCLUSION 

Through our selective greedy equivalence search algo¬ 
rithm SGES, we have demonstrated how to leverage 
graph-theoretic properties to reduce the need to score 
graphs during score-based search over equivalence classes 
of Bayesian networks. Eurthermore, we have shown 
that for graph-theoretic complexity properties including 
maximum-clique size, maximum number of parents, and 
v-width, we can guarantee that the number of score evalua¬ 
tions is polynomial in the number of nodes and exponential 
in these complexity measures. 

The fact that we can use our approach to selectively 
choose operators for any hereditary and equivalence in¬ 
variant graph-theoretic property provides the opportunity 
to explore alternative complexity measures. Another can¬ 
didate complexity measure is the maximum number of v- 
structures. Although the corresponding property does not 
limit the maximum size of a clique in H, it limits directly 
the size |H| for every operator. Thus it would be easy to 
enumerate these operators efficiently. Another complexity 
measure of interest is treewidth, due to the fact that exact 
inference in a Bayesian-network model is takes time expo¬ 
nential in this measure. 

The results we have presented are for the general Bayesian- 
network learning problem. It is interesting to consider the 
implications of our results for the problem of learning par¬ 
ticular subsets of Bayesian networks. One natural class that 
we discussed in Section|^is that of polytrees. If we assume 
that the generative distribution is perfect with respect to a 
polytree then we know the v-width of the generative graph 
is one. This implies, in the limit of large data, that we can 
recover the structure of the generative graph with a poly¬ 
nomial number of score evaluations. This provides a score- 
based recovery algorithm analogous to the constraint-based 
approach of Geiger et al. (1990). 

We presented a simple complexity analysis for the purpose 
of demonstrating that SGES uses a only polynomial num¬ 
ber of calls to the scoring function. We leave as future work 
a more careful analysis that establishes useful constants in 
this polynomial. In particular, we can derive tighter bounds 
on the total number of node-and-parent-configurations that 
are needed to score all the operators for each CPDAG, and 
by caching these configuration scores we can further take 
advantage of the fact that most operators remain valid (i.e., 
the preconditions still hold) and have the same score after 
each transformation. 

Einally, we plan to investigate practical implementations of 
poly-EES that have the polynomial-time guarantees needed 
for SGES. 









Appendices 

In the following two appendices, we prove Theorem]^ 

A Additional Background 

In this section, we introduce additional background mate¬ 
rial needed for the proofs. 

A.l Additional Notation 

To express sets of variables more compactly, we often use 
a comma to denote set union (e.g., we write X = Y, Z as a 
more compact version of X = Y U Z). We also will some¬ 
times remove the comma (e.g., YZ). When a set consists 
of a singleton variable, we often use the variable name as 
shorthand for the set containing that variable (e.g., we write 
X = Y \ Z as shorthand for X = Y \ {Z}). 

We say a node is a descendant ofYifN = Y or there is 
a directed path from Y to N. We use H-descendant to refer 
to a descendant in a particular DAG H. We say a node N 
is a proper descendant of Y if is a descendant of Y and 
N ^Y. We use NonDoy to denote the non-descendants 
of node Y in We use shorthand for 

Pay \ {Xi,..., Xn}. For example, to denote all the par¬ 
ents of Y in "H except for X and Y, we use Pa^^^y. 

A.2 D-separation and Acvite Paths 

The independence constraints implied by a DAG structure 
are characterized by the d-separation criterion. Two nodes 
A and B are said to be d-separated in a DAG Q given a 
set of nodes S if and only if there is no active path in Q 
between A and B given S. The standard dehnition of an 
active path is a simple path for which each node W along 
the path either (1) has converging arrows (i.e., —> W <—) 
and lY or a descendant of lY is in S or (2) does not have 
converging arrows and lY is not in S. By simple, we mean 
that the path never passes through the same node twice. 

To simplify our proofs, we use an equivalent dehnition of 
an active path—that need not be simple—where each node 
lY along the path either (1) has converging arrows and lY is 
in S or (2) does not have converging arrows and lY is not in 
S. In other words, instead of allowing a segment —> lY ^ 
to be included in a path by virtue of a descendant of lY be¬ 
longing to S, we require that the path include the sequence 
of edges from lY to that descendant and then back again. 
For those readers familiar with the celebrated “Bayes ball” 
algorithm of Shachter (1998) for testing d-separation, our 
expanded dehnition of an active path is simply a valid path 
that the ball can take between A and B. 

We use XYLpYIZ to denote the assertion that DAG Q im¬ 
poses the constraint that variables X are independent of 
variables Y given variables Z.When a node lY along a path 


has converging arrows, we say that VY is a collider at that 
position in the path. 

The direction of each terminal edge in an active path—that 
is, the hrst and last edge encountered in a traversal from one 
end of the path to the other—is important for determining 
whether we can append two active paths together to make 
a third active path. We say that a path tt(A, B) is into A if 
the terminal edge incident to A is oriented toward A (i.e., 
A •(—). Similarly, the path is into B if the terminal edge 
incident to B is oriented toward B. If a path is not into 
an endpoint A, we say that the path is out of A. Using the 
following result from Chickering (2002), we can combine 
active paths together. 

Lemma 1. (Chickering, 2002) Let tt{A, B) be an S-active 
path between A and B, and let tt{B, C) be an S-active 
path between B and C. If either path is out of B, then the 
concatenation ofTr{A, B) and x{B, C) is an S-active path 
between A and C. 

Given a DAG % that is an IMAP of DAG Q, we use the 
d-separation criterion in two general ways in our proofs. 
First, we identify d-separation facts that hold in % and con¬ 
clude that they must also hold in Q. Second, we identify 
active paths in Q and conclude that there must be corre¬ 
sponding active paths in %. 

A.3 Independence Axioms 

In many of our proofs, we would like to reason about the 
independence facts that hold in DAG Q without knowing 
what its structure is, which makes using the d-separation 
criterion problematic. As described in Pearl (1988), any 
set of independence facts characterized by the d-separation 
criterion also respect the independence axioms shown in 
Figure These axioms allow us to take a set of indepen¬ 
dence facts in some unknown Q (e.g., that are implied by 
d-separation in LL), and derive new independence facts that 
we know must also hold in Q. 

Throughout the proofs, we will often use the Symmetry 
axiom implicitly. For example, if we have AILB, C\D we 
might claim that B1LA\C^ D follows from Weak Union, 
as opposed to concluding A1YB\C^ D from Weak Union 
and then applying Symmetry. We will frequently identify 
independence constraints in % and conclude that they 
hold in Q, without explicitly justifying this with because 
Q <T-L. For example, we will say: 

Because A is a non-descendant of B in %, it follows from 
the Markov conditions that AlLgB\Pa^. 

In other words, to be explicit we would say that 
A_LL-^B|Pag follows from the Markov conditions, and 
the independence holds in Q because Q <1-1. 


Symmetry: 

Decomposition: 

Composition: 

Intersection: 

Weak Union: 
Contraction: 

Weak Transitivity: 


X_LLY|Z 
X_LLY,wiz 
X_LLY|Z + XHwjz 
X_LLY|Z,W + X_LLW|Z,Y 
X_LLY,W|Z 
X_LLW|Z,Y + XIIYIZ 
X_LLY|Z + X_LLY|Z,T 



Y_LLX|Z 

XIIYIZ + X I I W|Z 

XI1Y,W|Z 

XIlY,wiz 

XI1Y|Z,W 

XI1Y,W|Z 

XI1T|Z OR YIir|Z 


Figure 9: The DAG-perfect independence axioms. 


The Composition axiom states that if X is independent of 
both Y and W individually given Z, then X is independent 
of them jointly. If we have more than two such sets that are 
independent of X, we can apply the Composition axiom re¬ 
peatedly to combine them all together. To simplify, we will 
do this combination implicitly, and assume that the Compo¬ 
sition axiom is defined more generally. Thus, for example, 
we might have: 

Because X1LY\'L for every Y C Y, we conclude by the 
Composition axiom that XILY|Z. 

B Proofs 

In this section, we provide a number of intermediate results 
that lead to a proof of Theorem]^ 

B.l Intermediate Result: “The Deletion Lemma” 

Given DAGs Q and H for which Q < "H, we say that an 
edge e from H is deletable in H with respect to Q if, for 
the DAG H' that results after removing e from H, we have 
G < H- We will say that an edge is deletable in H or 
simply deletable if G or both DAGs, respectively, are clear 
from context. The following lemma establishes necessary 
and sufficient conditions for an edge to be deletable. 

Lemma 2. Let G and H be two DAGs such that G < H- 
An edge X ^ Y is deletable in H with respect to G if and 
onlyifYlYgX\PsJ}\X. 

Proof: Let T-L' be the DAG resulting from removing the 
edge. The “only if” follows immediately because the given 
independence is implied by H' . For the “if”, we show that 
for every node A and every node B G NonDe^ , the inde¬ 
pendence AIlgi3|Pa^ holds (in G)- We need only con¬ 
sider {A, B) pairs for which S is a descendant in TT but not 
in %'■, if the “descendant” relationship has not changed, we 
know the independence holds by virtue of (/ < K and the 
fact that deleting an edge results in strictly more indepen¬ 
dence constraints. 

The proof follows by induction on the length of the longest 
directed path in T-L' from Y to B. For the base case (see 
Figure [TO^ and Figure [TOir), we start with a longest path 
of length zero; in other words, B = Y. Because A is an 


ancestor of Y in %, both it and its parents must be non¬ 
descendants of Y in T-L, and therefore the Markov condi¬ 
tions in T-L imply 

YIleA,Pa^|Pa^ (2) 

Given the independence fact assumed in the lemma, we can 
apply the Contraction axiom to remove X from the condi¬ 
tioning set in and then apply the Weak Union axiom to 
move Pa^ into the conditioning set to conclude 

YIleA|Pa^\X,Pa^ (3) 

Neither Y nor its new parents Pay \ X can be descendants 

of AinTl', else B would remain a descendant of A after the 
deletion, and thus we conclude by the Markov conditions 
in Ti that 

AIlgPa^\X|Pa^ (4) 

Applying the Contraction axiom to ([^ and Q, we have 

AIlgY|Pa^ 

nj' nj 

and because Pa^ = Pa^ the lemma follows. 

For the induction step (see Figure [TO}: and Figure [T^), we 
assume the lemma holds for all nodes whose longest path 
from Y is < k, and we consider a B for which the longest 
path from Y is fc -f 1. Consider any parent P of node B. 
If P is a descendant of Y, the longest path from Y to B 
must be < k, else we have a path to B that is longer than 
k 1. If P is not a descendant of Y, then P is also not a 
descendant of A in P, else B would be a descendant of A 
in P'. Thus, for every parent P, we conclude 

AllgPlPa^ 

either by the induction hypothesis or by the fact that P is a 
non-descendant of A in P. From the Composition axiom 
we can combine these individual parents together, yielding 

A^gPa«|Pa^ (5) 

Because P is a descendant of A in P, we know that A and 

nj 

all of its parents Pa^ are non-descendants of B in 1-L, and 
thus 


PIlgA,Pa^|Pa2 


( 6 ) 



Figure 10; Relevant portions of % and %' for the inductive proof of Lemma[^ (a) and (b) are for the basis and (c) and (d) 
are for the induction hypothesis. 


Applying the Weak Union axiom to 0 yields 

B^eA|Pa^,Pa« (7) 

and hnally applying the Contraction axiom to Q and (|^ 
yields 

Al^gB\PSi^ 

Because the parents of A are the same in both T-L and , 
the lemma follows. □ 

B.2 Intermediate Result: “The Deletion Theorem” 

We dehne the pruned variables for Q and % —denoted 
Prune(0, Td)- to be the subset of the variables that remain 
after we repeatedly remove from both graphs any common 
sink nodes (i.e., nodes with no children) with the same par¬ 
ents in both graphs. For V = Prune(5,'H), let V de¬ 
note the complement of V. Note that every node in V has 
the same parents and children in both Q and H, and that 

giV]=H[V]. 

We use 0-leaf to denote any node in Q that has no children. 
For any 0-leaf L, we say that L is an H-lowest Q-leaf if no 
proper descendant of L in "H is a 0-leaf . Note that we are 
discussing two DAGs in this case: L is a leaf in Q, and out 
of all nodes that are leaves in Q, L is the one that is lowest 
in the other DAG H. To avoid ambiguity, we often prefix 
other common graph concepts (e.g., Q-child and Td-parent) 
to emphasize the specihc DAG to which we are referring. 


We need the following result from Chickering (2002). 

Lemma 3. (Chickering, 2002) Let Q and Td be two DAGs 
containing a node Y that is a sink in both DAGs and for 
which Pay = Pay. Let Q' and Td' denote the subgraphs 
of 0 and Td, respectively, that result by removing node Y 
and all its in-coming edges. Then Q < Td if and only if 

G' < Ti'. 

By repeatedly applying Lemma the following corollary 
follows immediately. 

Corollary 1. Let V = Prune(0, Td). Then Q <Td if and 
only if Gy < Tiy. 

We now present the “deletion theorem”, which is the basis 
for Theorem |2] 

Theorem 4. Let G and Td be DAGs for which G < Td, let 
V = Prune(0, Td), and let L be any Td\V]-lowest 0[V]- 
leaf. Then, 

1. If L does not have any Td\V]-children, then for every 
Z? C V that is an Td\V]-parent of L but not a 0[V]- 
parent of L, D ^ L is deletable in Td. 

2. If L has at least one Td\y]-child, let A be any 'H[V]- 
highest child; one of the following three properties 
must hold in Td: 

(a) L ^ Ais covered. 















(b) There exists an edge A ^ B, where L and B are 
not adjacent, and either L ^ A or A ^ B (or 
both) are deletable. 

(c) There exists an edge D ^ L, where D and A are 
not adjacent, and either D ^ L or L ^ A (or 
both) are deletable. 

Proof: As a consequence of Corollary [T] the lemma holds 
if and only if it holds for any graphs Q and TL for which 
there are no nodes that are sinks in both graphs with the 
same parents; in other words, Q = Q-y and TL = TLy. 
Thus, to vastly simplify the notation for the remainder of 
the proof, we will assume that this is the case, and therefore 
L is a leaf node in A is a highest child of L in TL, and 
the restriction of B and Z? to V is vacuous. 

For case (1), we know that Pa£ C Pa^, else there would 
be some edge in X —i' L in Q for which X and L are not 
adjacent in TL, contradicting Q < TL. Because L is a leaf 
in Q, all non-parents must also be non-descendants, and 
hence LlLgX\PaP^ for all X. It follows that for every 
D e {Pa|f \ Pa£}, Z? —^ y is deletable in TL. There must 
exist such a D, else L would be in V = Prune(^, TL). 

For case (2), we now show that at least one of the properties 
must hold. Assume that the first property does not hold, 
and demonstrate that one of the other two properties must 
hold. If the first property does not hold then we know that 
in TL either there exists an edge A B where B is not a 
parent of L, or there exists an edge D ^ L where D is not 
a parent of A. Thus the pre-conditions of at least one of the 
remaining two properties must hold. 

Suppose TL contains the edge A-^ B where B is not a par¬ 
ent of L. Then we conclude immediately from Corollary]^ 
that either L ^ Aox A-^ B is deletable in TL. 

Suppose TL contains the edge D ^ L where D is not a par¬ 
ent of A. Then the set D containing all parents of L that 
are not parents of A is non-empty. Let R = Pa^ H Pa^ 
be the shared parents of L and A, and let T = 
be the remaining non-L parents of A in TL, so that we have 
Pa^ = L, R, T and Pa^ = R, D. Because no node in 
D is a child or a descendant of A, lest "H contains a cy¬ 
cle, we know that "H contains the following independence 
constraint that must hold in Q: 

A_LLgD|L,R,T (8) 

Because L is a leaf node in Q, it is impossible to create a 
new active path by removing it from the conditioning set, 
and hence we also know 

A^eD|R,T (9) 

Applying the Weak Transitivity axiom to Independence 
and Independence]^ we conclude either AlLgL\R,,'T — 
in which case L ^ Ais deletable-or 


( 10 ) 


We know that no node in T can be a descendant of L, or 
else A would not be the highest child of L. Thus, because 
L is independent of any non-descendants given its parents 
we have 

L_LLeT|R,D (11) 


Applying the Intersection axiom to Independence 10 and 
Independence]^ we have 


LIlpDIR 


( 12 ) 


In other words, L is independent of all of the nodes in D 
given the other parents. By applying the Weak Union ax¬ 
iom, we can pull all but one of the nodes in D into the 
conditioning set to obtain 

LlLgD\R,{'D\D} (13) 


and hence D ^ L is deletable for each such D. □ 


B.3 Intermediate Result: “Add A Singleton 
Descendant to the Conditioning Set” 

The intuition behind the following lemma is that if L is 
an 7Z-lowest ^-leaf , no v-structure below L in TL can be 
“real” in terms of the dependences in Q: for any Y below L 
that is independent of some other node X, they remain in¬ 
dependent when we condition on any singleton descendant 
Z of Y, even if Z is also a descendant of X. The lemma 
is stated in a somewhat complicated manner because we 
want to use it both when (1) X and Y are adjacent but the 
edge is known to be deletable and (2) X and Y are not ad¬ 
jacent. We also find it convenient to include, in addition 
to y’s non-X parents, an arbitrary additional set of non¬ 
descendants S. 

Lemma 4. Let Y be any TL-descendant of an TL-lowest 
g-leaf. If 

YlLgX\Pn^^^,S 

for{X,S} C NonDe^, then YlLgX\Pa^^j^,S, Z for 
any proper TL-descendant Z ofY. 

Proof: To simplify notation, let R = PsJ}^x^ S- Assume 
the lemma does not hold and thus YH^^XIR, Z. Consider 
any (R, Z)-active path txxy between X and Y in Q. Be¬ 
cause y_LLpX|R, this path cannot be active without Z in 
the conditioning set, which means that Z must be on the 
path, and it must be a collider in every position it occurs. 
Without loss of generality, assume Z occurs exactly once as 
a collider along the path (we can simply delete the sub-path 
between the first and last occurrence of Z, and the resulting 
path will remain active), and let ttxz be the sub-path from 
X to Z along TTjfy, and let Tr^y be the sub-path from Z to 
Y along -KxY- 

Because Z is a proper descendant of Y in TL, and X is a 
descendant of an "H-lowest (/-leaf , we know Z cannot be 
a (/-leaf , else it would be lower than L in TL. That means 


L_LLgD|R,T 



(17) 


that in Q, there is a directed path ttzl' = Z ^ ^ L' 

consisting of at least one edge from Z to some ^-leaf . No 
node T along this path can be in R, else we could splice 
in the path Z^...^T^...^Z between ttx z 
and TTzY, and the resulting path would remain active with¬ 
out Z in the conditioning set. Note that this means that L' 
cannot belong to C R. Similarly, the path cannot 

reach X or Y, else we could combine this out-of-Z path 
with TTzY or TTxz, respectively, to again find an R-active 
path between X and Y. We know that in H, L' must be 
a non-descendant of Y, else L' would be a lower Q-leaf 
than L in H. Because X U R contains all of F’s parents 
and none of its descendants, and because (as we noted) L' 
cannot be in X U R, we know H contains the independence 
YIL-^L'lX, R. But we just argued that the (directed) path 
TTzL' in G does not pass through any of X, y,R, which 
means that it constitutes an out-of Z (R, X)-active path 
that can be combined with Tr^y to produce a (R, X)-active 
path between Z and L', yielding a contradiction. □ 

B.4 Intermediate Result: The “Weak-Transitivity 
Deletion” Lemma 

The next lemma considers a collider X Z Y in 
7i where either there is no edge between X and Y (i.e., 
the collider is a v-structure) or the edge is deletable. The 
lemma states that if X and Y remain independent when 
conditioning on their common child—where all the non- 
{X, Y, Z} parents of all three nodes are also in the condi¬ 
tioning set—then one of the two edges must be deletable. 

Lemma 5. Let X ^ Z and Y ^ Z be two 
edges in H. 7/'XllpyjPa^^y, Pay^^j-, Pa^^jfy and 
XUpylPa^^y, Pay^j^, Pa^^jjfy, Z (i.e., Z added to 
the conditioning set), then at least one of the following must 
hold: ZlYgX\Psi^^x or ZlYgY\Psi^^y. 

Proof: Let S = {Pa^^y, Pay^j(-}\Pa 2 ^jfy be the (non- 
X and non-F) parents of X and Y that are not parents of 
Z, and let R = tie all of Z’s parents other than 

X and Y. Using this notation, we can re-write the two 
conditions of the lemma as; 

X_LLgF|R,S (14) 

and 

X_LLeF|Z,R,S (15) 

From the Weak Transitivity axiom we conclude from 
these two independences that either ZTlpXjR, S or 
ZlLgXlIL, S. Assume the first of these is true 

Z_LLgX|R,S (16) 

If we apply the Composition axiom to the independences 
in Equation [l4] and Equation [T^ we get XlL-fiY, Z|R, S; 
applying the Weak Union axiom we can then pull Y into 


the conditioning set to get; 

Z^„X|{F,R},S 

Because {F, R},X is precisely the parent set of Z, and 
because S (i.e., the parents of Z’s parents) cannot contain 
any descendant of Z, we know by the Markov conditions 
that 

Z^„S|{F,R},X (18) 

Applying the Intersection Axiom to the independences in 
Equation [T7| and EquationfTSlyields; 

Z_LL«X|F,R 

Because F, R = Pa^^j^, this means the first independence 
implied by the lemma follows. 

If the second of the two independence facts that follow 
from Weak Transitivity hold (i.e., if Z_LLpX|R, S), then a 
completely parallel application of axioms leads to the sec¬ 
ond independence implied by the lemma. □ 

B.5 Intermediate Result: The “Move Lower” Lemma 

Lemma 6. Let Y be any LL-descendant of an LL-lowest 
G-leaf. If there exists an X G NonDoy that has a com¬ 
mon LL-descendant with Y and for which 

YlYgX\Pa^^x 

then there exists an edge IF —>■ Z that is deletable in %, 
where Z is a proper H-descendant ofY. 

Proof: Let Z be the highest common descendant of F and 
X, let Dy be the lowest descendant of F that is a parent 
of Z, and let Dx be any descendant of X that is a parent 
of Z. We know that either (1) Dy — Y and Dx = X or 
(2) Dy and Dx are not adjacent and have no directed path 
connecting them; if this were not the case, and LL contained 
a path Dy ^ Dx (Dx Dy) then Dx 

(Dy) would be a higher common descendant than Z. This 
means that in either case (1) or in case (2), we have 

Dy\YgDx\PB^,^j,^ (19) 

Eor case (1), this is given to us explicitly in the statement 
of the lemma, and for case (2), 

thus the independence holds from the Markov conditions 
in H because Dx is a non-descendant of Dy. Because in 
both cases we know there is no directed path from Dy to 
Dx, we know that all of are non-descendants of 

Dy, and thus we can add them (via Composition and Weak 
Union) to the conditioning set of Equation [T^ 

DY-U-gDx\Pa^YiDx’^^^xiDY (^ 0 ) 

nj 

For any Pz € P^ziDyDx parent of Z excluding 

Dy and Dx), we know that Pz cannot be a descendant of 


Dy, else Pz would have been chosen instead of Dy as the 
lowest descendant of Y that is a parent of Z. Thus, we can 
yet again add to the conditioning set (via Composition and 
Weak Union) to get: 

Dy -U-gDx\Pa^YiDx Dx 

Because no member of the conditioning set in Equation!^ 
is a descendant of Dy, and because Dy, by virtue of being 
a descendant of Y, must also be a descendant of the P- 
lowest ^-leaf, we conclude from Lemmaj^that for (proper 
Tf-descendant of Y) Z we have: 

DylLgDxlPa^YiDx ’ ^ 

( 22 ) 

Given Equation]^ and Equation]^ we can apply Lemma 
1^ and conclude either (1) ZlLgDylPa^^ij^ and hence 
Dy Z is deletable in H or (2) ZlLgDx\Ps^ziDx 
hence Dx Z is deletable in "H □ 

Corollary 2. Let L be an P-lowest Q-leaf, and let A be 
any P-highest child of L. If there exists an edge A B in 
Pfor which L and B are not adjacent, then either L ^ A 
or A ^ B is deletable in P. 

Proof: Because L is equal to (and thus a descendant of) 
an H-lowest Q-leaf, it satisfies the requirement for “Y” in 
the statement of Lemma Because A is the highest child 
of L, B cannot be a descendant of L and thus satisfies the 
requirement of “2f” in the statement of Lemma Erom 
the proof of the lemma, if we choose A to be the highest- 
common descendant (i.e., “Z”), the corollary follows by 
noting that because A is the highest H-child of L, L must 
be a lowest parent of A, and thus we can choose Dy = L 
Dx = B. □ 

B.6 Intermediate Result: “The Move-Down 
Corollary” 

Corollary 3. Let X ^ Y be any deletable edge within P 
for which Y is a descendant of an P-lowest Q-leaf. Then 
there exists an edge Z ^ W that is deletable in P for 
which Z and W have no common descendants. 

Proof: If X and Y have a common descendant, we know 
from Lemma that there must be another deletable edge 
Z ^ W for which lU is a proper descendant of Y, and 
thus Z and W satisfy the conditions for “X” and “U”, re¬ 
spectively, in the statement of Lemma but with a lower 
“F” than we had before. Because P is acyclic, if we re¬ 
peatedly apply this argument we must reach some edge for 
which the endpoints have no common descendants. □ 

B.7 Main Result: Proof of Theorem 

Theorem |^ // < C for CPDAG C and DAG Q, then 

for any EIH property If that holds on Q, there exists a 11- 
consistent Delete{X, Y, H) that when applied to C results 
in the CPDAG C' for which Q < C'. 


Proof: Consider any DAG P'^ in [C]~. Prom Theorem 
we know that there exists either a covered edge or deletable 
edge in P^-, if we reverse any covered edge in DAG "H®, the 
resulting DAG (which is equivalent to PA will be 

closer to Q in terms of total edge differences, and therefore 
because P^ 7^ 0 we must eventually reach anP = P^ for 
which Theorem identifies a deletable edge e. The edge e 
in P satisfies the preconditions of Corollaryj^ and thus we 
know that there must also exist a deletable edge X ^ Y 
in P for which X and Y have no common descendants in 
P[V] forV = Prune{Q,P). 

Let P' be the DAG that results from deleting the edge 
X —i' Y in P. Because there is a GES delete opera¬ 
tor corresponding to every edge deletion in every DAG 
in [CJsi, we know there must be a set H for which the 
operator Delete{X, Y, H)—when applied to C —results in 
C' — [P']k:. Because AT —F is deletable in P, the op¬ 
erator satisfies the IMAP requirement in the theorem. Por 
the remainder of the proof, we demonstrate that it is 11- 
consistent. 

Because all directed edges in C' are compelled, these edges 
must exist with the same orientation in all DAGs in 
it follows that any subset W of the common descendants 
of X and F in C' must also be common descendants of 
X and F inP'. But because X and F have no common 
descendants in the “pruned” subgraph ’H[V], we know that 
W is contained entirely in the complement of V, which 
means 'H[W] = ^[W]; because P' is the same as P except 
for the edge X —>• F, we conclude = (/[W]. 

We now consider the induced subgraph P'[W U X U F] 
that we get by “expanding” the graph P' [W] to include X 
and F. Because X and F are not adjacent in P', and be¬ 
cause P' is acyclic, any edge in P'[W U X U F] that is not 
in P'[W] must be directed from either X or F into a node 
in the descendant set W. Because all nodes in W are in 
the complement of V, these new edges must also exist in 
Q, and we conclude 7f'[W U X U F] = C/[W U X U F]. 
To complete the proof, we note that because If is heredi¬ 
tary, it must hold on P'[YV U X U F]. Prom Proposition 
??, we know ^'[W U X U F] « C'[W U X U F]), and 
therefore because If is equivalence invariant, it holds for 
C'[WUXUF]. □ 
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